自然语言转SQL 一个微调ChatGPT3.5的实例

背景

让非技术人员可以从数据库中提问，这是学术界和工业界多年来感兴趣的问题。最近，大型语言模型（LLM）技术（如GPT-4）的进展提高了所提出解决方案的准确性。然而，由于最先进的LLM尚未开放进行微调，因此最近在这一领域的研究集中在创建能够在不修改基础LLM的情况下实现复杂的自然语言到SQL（NL-to-SQL）场景的检索增强生成（RAG）算法。

上周，OpenAI开放了GPT-3.5-turbo供微调使用。在本文中，我们将微调自己的NL-to-SQL模型，并将其性能与最先进的RAG方法进行比较。我们将使用耶鲁大学的Spider数据集作为测试基准。

以下是来自Spider数据集的两个示例：样例。通过简单的处理，我们可以从这个数据集获得以下训练数据：


xxxxxxxxxx
10
1
[
2
  {
3
    "query": "SELECT count(*) FROM singer",
4
    "question": "有多少歌手？"
5
  },
6
  {
7
    "query": "SELECT count(*) FROM singer",
8
    "question": "歌手的总人数是多少？"
9
  }
10
]

准备训练数据集

对于微调GPT-3.5-Turbo，首先要做的是创建和上传训练数据集。由于GPT-3.5-Turbo是一个ChatModel，因此此数据集必须使用以下格式，并被上传为JSONL文件。


xxxxxxxxxx
3
1
{"messages": [{"role": "system", "content": "system_prompt"}, {"role": "user", "content": "user_prompt"}, {"role": "assistant", "content": "assistant_prompt"}]}
2
{"messages": [{"role": "system", "content": "system_prompt"}, {"role": "user", "content": "user_prompt"}, {"role": "assistant", "content": "assistant_prompt"}]}
3
{"messages": [{"role": "system", "content": "system_prompt"}, {"role": "user", "content": "user_prompt"}, {"role": "assistant", "content": "assistant_prompt"}]}

Spider数据集有一个包含2147个问题/SQL对的测试集，1034个问题/SQL对的开发集和7000个问题/SQL对的训练集。我们将从Spider训练集中构建我们的微调数据集，使用上述结构。

创建训练数据集

一个NL-to-SQL任务的定义如下：给定一个问题和数据库，确定一个SQL查询，当对数据库执行该查询时，返回一个结果集，可以回答这个问题。对于如何更好地提示LLMs执行此任务，研究人员探索了各种方法，并普遍认为提示需要包括指令组件、Data Schema的详细信息、关于数据内容、一组任务特定的演示以及实际的问题。

给定ChatModel训练数据的格式，上述元素必须在以下三个提示中呈现：

system_prompt – 包含指令、Data Schema和数据库内容
user_prompt – 包含自然语言问题
assistant_prompt – 包含SQL查询和推理步骤

让我们看看如何为我们的NL-to-SQL训练数据集创建每个提示。

准备 system_prompt

创建system_prompt是这个任务中最复杂的部分。至少，system_prompt需要包括：

系统指令
Data Schema
数据内容

此外，对于任何实际的用例，如果有大量的表，则训练集中的样本还应该训练模型选择正确的表格用于SQL查询（即执行模式链接）。

准备系统指令

对于指令，我们使用了以下标准提示：


xxxxxxxxxx
3
1
You are an assistant that is an expert in generating Sqlite SQL queries.
2
Having the access to database content, generate a correct Sqlite SQL query for the given question.
3
### Database content ###

Data Schema

对于Data Schema，文献中有许多提出的格式，但没有明确的共识，哪种格式表现最佳。我们发现以下是Data Schema的最佳表示方式：


xxxxxxxxxx
20
1
CREATE TABLE concert (
2
"concert_ID" INTEGER NOT NULL,
3
"concert_Name" TEXT NOT NULL,
4
"Theme" TEXT,
5
"Stadium_ID" TEXT NOT NULL,
6
"Year" TEXT,
7
PRIMARY KEY ("concert_ID"),
8
FOREIGN KEY("Stadium_ID")
9
REFERENCES stadium ("Stadium_ID")
10
)
11
CREATE TABLE singer (
12
"Singer_ID" INTEGER NOT NULL,
13
"Name" TEXT,
14
"Country" TEXT NOT NULL,
15
"Song_Name" TEXT NOT NULL,
16
"Song_release_year" TEXT,
17
"Age" INTEGER,
18
"Is_male" BOOLEAN NOT NULL,
19
PRIMARY KEY ("Singer_ID")
20
)

数据内容

经过多次实验，我们发现以下模板在最佳的训练模型数据内容方面表现最好：


xxxxxxxxxx
24
1
/*
2
Columns in concert and 3 examples in each column for the high cardinality columns :
3
concert_ID: 1025 , 1101 , 1247
4
concert_Name : "Fire", "Dance", "Sky"
5
Stadium_ID : 9, 10, 11
6
*/
7
/*
8
Columns in concert and all categories for the low cardinality columns :
9
Theme : " ROCK ", " POP ", " HIP-HOP "
10
Year : 2022, 2021, 2023, 2020
11
*/
12
/*
13
Columns in singer and 3 examples in each column for the high cardinality columns :
14
Singer_ID : 10235 , 110231 , 1242447
15
Name : "Jordan", "Gabriel", "Tiffany"
16
Country : "Iran", "India", "Canada"
17
Song_Name : "dance in the fire", "rain", "sky"
18
Age : 19, 20, 21
19
*/
20
/*
21
Columns in singer and all categories for the low cardinality columns :
22
Is_male : "MALE", "FEMALE",
23
Song_release_year : 2022, 2021, 2023, 2020
24
*/

数据库内容中的一个重要元素是如何识别分类（低基数）列。区分低基数和高基数列的阈值取决于要微调的大型语言模型（LLM）的上下文窗口大小。给定GPT-3.5-turbo的4096令牌上下文窗口，我们确定20个令牌是低基数和高基数列之间适当的阈值。

模式链接

为了为我们训练集创建正确的system_prompt，需要以这样的方式提供样本，以便训练模型在数据库上正确执行模式链接。为此，我们采用了以下启发式方法：对于每个单独的NL <> SQL样本，我们除了正确的表之外，还随机选择了数据库中的其他表，直到达到4000个令牌的上下文窗口限制为止。为了减少位置信息的影响，我们进一步随机化了表的顺序。简而言之，每个system_prompt包括相关表的模式和内容与其他不相关的表混合在一起，帮助训练模型选择查询的正确表。

现在，我们将所有这些放在一起，构建我们的system_prompt。

对于来自Spider的以下示例：


xxxxxxxxxx
2
1
question："How many heads of the departments are older than 56?"
2
SQL: "SELECT count(*) FROM head WHERE age > 56"

system_prompt将是


xxxxxxxxxx
79
1
You are an assistant that is an expert in generating Sqlite SQL queries.
2
Having the access to database content, generate a correct Sqlite SQL query for the given question.
3
### Database content ###CREATE TABLE trip (
4
id INTEGER, duration INTEGER,
5
start_date TEXT,
6
start_station_name TEXT,
7
start_station_id INTEGER,
8
end_date TEXT,
9
end_station_name TEXT,
10
end_station_id INTEGER,
11
bike_id INTEGER,
12
subscription_type TEXT,
13
zip_code INTEGER,
14
PRIMARY KEY (id)
15
)
16
/* Columns in trip and 3 examples in each column for high cardinality columns :
17
id : 900645, 900752, 900524
18
duration : 1131, 2146, 1155
19
start_date : 8/21/2015 17:39, 8/21/2015 17:03, 8/21/2015 17:16
20
start_station_name : Howard at 2nd, 2nd at Folsom, Market at 10th
21
start_station_id : 56, 65, 49
22
end_date : 8/21/2015 17:19, 8/21/2015 18:08, 8/21/2015 17:32
23
end_station_name : Howard at 2nd, 2nd at Folsom, Market at 10th
24
end_station_id : 56, 65, 49
25
bike_id : 586, 56, 65
26
zip_code : 94070, 94530, 94040–1724
27
*/
28
/* Columns in trip and all categories for low cardinality columns :
29
subscription_type : Customer, Subscriber
30
*/
31

32
CREATE TABLE management (
33
"department_ID" INTEGER,
34
"head_ID" INTEGER,
35
temporary_acting TEXT,
36
PRIMARY KEY ("department_ID", "head_ID"),
37
FOREIGN KEY("head_ID") REFERENCES head ("head_ID"),
38
FOREIGN KEY("department_ID") REFERENCES department ("Department_ID")
39
)
40
/* Columns in management and all categories for low cardinality columns :
41
department_ID : 7, 15, 2, 11
42
head_ID : 5, 4, 6, 3, 10
43
temporary_acting : Yes, No
44
*/
45

46
CREATE TABLE department (
47
"Department_ID" INTEGER,
48
"Name" TEXT,
49
"Creation" TEXT,
50
"Ranking" INTEGER,
51
"Budget_in_Billions" REAL,
52
"Num_Employees" REAL,
53
PRIMARY KEY ("Department_ID")
54
)
55
/* Columns in department and 3 examples in each column for high cardinality columns :
56
Department_ID : 1, 13, 11
57
Name : Energy, Interior, Health and Human Services
58
Creation : 1913, 1979, 1989
59
Ranking : 1, 13, 11
60
Budget_in_Billions : 10.7, 77.6, 59.7
61
Num_Employees : 112557.0, 3000000.0, 235000.0
62
*/
63

64
...
65
CREATE TABLE head (
66
"head_ID" INTEGER,
67
name TEXT,
68
born_state TEXT,
69
age REAL,
70
PRIMARY KEY ("head_ID")
71
)
72
/* Columns in head and all categories for low cardinality columns :
73
head_ID : 1, 2, 5, 7, 8, 4, 6, 3, 10, 9
74
name : Jeff Maggert, Pádraig Harrington, Billy Mayfair, K. J. Choi, Dudley Hart, Sergio García, Stewart Cink, Tiger Woods, Nick Faldo, Franklin Langham
75
born_state : Delaware, Connecticut, Alabama, California, Florida
76
age : 69.0, 67.0, 68.0, 53.0, 56.0, 52.0, 50.0, 43.0
77
*/
78

79
...

user_prompt

用户提示很简单，每个Spider样本的用户问题。例如：


xxxxxxxxxx
1
1
How many heads of the departments are older than 56?

assistant_prompt

助理提示也很简单，包含了来自Spider的相关SQL查询和找到正确列和正确表的推理步骤。为了构建推理步骤，我们只需提取在SQL查询中使用的表和列。例如：


xxxxxxxxxx
4
1
To construct the query, I'll be working with the following tables: head.
2
From these tables, I'll be using the following columns: age.
3
The SQL query I'll be generating is:
4
SELECT count(*) FROM head WHERE age > 56

最终的数据格式


xxxxxxxxxx
16
1
{
2
  "messages": [
3
    {
4
      "role": "system",
5
      "content": "\nYou are an assistant that is an expert in generating sqlite SQL queries.\nHaving the access to database content, generate a correct sqlite SQL query for the given question.\n### Database content ###\n        \nCREATE TABLE trip (\n\tid INTEGER, \n\tduration INTEGER, \n\tstart_date TEXT, \n\tstart_station_name TEXT, \n\tstart_station_id INTEGER, \n\tend_date TEXT, \n\tend_station_name TEXT, \n\tend_station_id INTEGER, \n\tbike_id INTEGER, \n\tsubscription_type TEXT, \n\tzip_code INTEGER, \n\tPRIMARY KEY (id)\n)\n/*\nColumns in trip and 3 examples in each column for high cardinality columns :\nid : 900645, 900752, 900524\nduration : 1131, 2146, 1155\nstart_date : 8/21/2015 17:39, 8/21/2015 17:03, 8/21/2015 17:16\nstart_station_name : Howard at 2nd, 2nd at Folsom, Market at 10th\nstart_station_id : 56, 65, 49\nend_date : 8/21/2015 17:19, 8/21/2015 18:08, 8/21/2015 17:32\nend_station_name : Howard at 2nd, 2nd at Folsom, Market at 10th\nend_station_id : 56, 65, 49\nbike_id : 586, 56, 65\nzip_code : 94070, 94530, 94040-1724\n*/\n/*\nColumns in trip and all categories for low cardinality columns :\nsubscription_type : Customer, Subscriber\n*/\n \nCREATE TABLE \"Problems\" (\n\tproblem_id INTEGER, \n\tproduct_id INTEGER NOT NULL, \n\tclosure_authorised_by_staff_id INTEGER NOT NULL, \n\treported_by_staff_id INTEGER NOT NULL, \n\tdate_problem_reported DATETIME NOT NULL, \n\tdate_problem_closed DATETIME, \n\tproblem_description VARCHAR(255), \n\tother_problem_details VARCHAR(255), \n\tPRIMARY KEY (problem_id), \n\tFOREIGN KEY(reported_by_staff_id) REFERENCES \"Staff\" (staff_id), \n\tFOREIGN KEY(product_id) REFERENCES \"Product\" (product_id), \n\tFOREIGN KEY(closure_authorised_by_staff_id) REFERENCES \"Staff\" (staff_id)\n)\n/*\nColumns in Problems and 3 examples in each column for high cardinality columns :\nproblem_id : 1, 13, 11\nclosure_authorised_by_staff_id : 1, 13, 2\ndate_problem_reported : 1995-05-14 08:32:56, 1988-11-07 16:09:31, 1986-11-13 07:30:55\ndate_problem_closed : 1974-09-20 13:42:19, 1997-10-18 20:09:57, 2004-06-20 01:08:25\nproblem_description : d, i, s\n*/\n/*\nColumns in Problems and all categories for low cardinality columns :\nproduct_id : 1, 13, 2, 5, 7, 8, 4, 6, 15\nreported_by_staff_id : 1, 13, 11, 2, 5, 7, 4, 14, 10\nother_problem_details : f, m, i, s, k, l, p, v, c\n*/\n \nCREATE TABLE management (\n\t\"department_ID\" INTEGER, \n\t\"head_ID\" INTEGER, \n\ttemporary_acting TEXT, \n\tPRIMARY KEY (\"department_ID\", \"head_ID\"), \n\tFOREIGN KEY(\"head_ID\") REFERENCES head (\"head_ID\"), \n\tFOREIGN KEY(\"department_ID\") REFERENCES department (\"Department_ID\")\n)\n/*\nColumns in management and all categories for low cardinality columns :\ndepartment_ID : 7, 15, 2, 11\nhead_ID : 5, 4, 6, 3, 10\ntemporary_acting : Yes, No\n*/\n \nCREATE TABLE category (\n\tcategory_id INTEGER NOT NULL, \n\tname VARCHAR(25) NOT NULL, \n\tlast_update TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL, \n\tPRIMARY KEY (category_id)\n)\n/*\nColumns in category and 3 examples in each column for high cardinality columns :\ncategory_id : 1, 16, 13\nname : Family, Sci-Fi, Action\n*/\n/*\nColumns in category and all categories for low cardinality columns :\nlast_update : 2006-02-15 04:46:27\n*/\n \nCREATE TABLE ship (\n\t\"Ship_ID\" INTEGER, \n\t\"Name\" TEXT, \n\t\"Type\" TEXT, \n\t\"Nationality\" TEXT, \n\t\"Tonnage\" INTEGER, \n\tPRIMARY KEY (\"Ship_ID\")\n)\n/*\nColumns in ship and all categories for low cardinality columns :\nShip_ID : 1, 2, 5, 7, 8, 4, 6, 3\nName : Clan McTavish, Farringford, Appam, Author, Dromonby, Corbridge, Trader, Ariadne\nType : Battle ship, Cargo ship\nNationality : United States, United Kingdom\nTonnage : 3035, 3146, 7781, 3496, 3687, 5816, 3627, 3608\n*/\n \nCREATE TABLE member_attendance (\n\t\"Member_ID\" INTEGER, \n\t\"Performance_ID\" INTEGER, \n\t\"Num_of_Pieces\" INTEGER, \n\tPRIMARY KEY (\"Member_ID\", \"Performance_ID\"), \n\tFOREIGN KEY(\"Performance_ID\") REFERENCES performance (\"Performance_ID\"), \n\tFOREIGN KEY(\"Member_ID\") REFERENCES member (\"Member_ID\")\n)\n/*\nColumns in member_attendance and all categories for low cardinality columns :\nMember_ID : 1, 11, 2, 5, 7, 4, 3\nPerformance_ID : 1, 2, 4, 6, 3\nNum_of_Pieces : 1, 2, 4, 3\n*/\n \nCREATE TABLE department (\n\t\"Department_ID\" INTEGER, \n\t\"Name\" TEXT, \n\t\"Creation\" TEXT, \n\t\"Ranking\" INTEGER, \n\t\"Budget_in_Billions\" REAL, \n\t\"Num_Employees\" REAL, \n\tPRIMARY KEY (\"Department_ID\")\n)\n/*\nColumns in department and 3 examples in each column for high cardinality columns :\nDepartment_ID : 1, 13, 11\nName : Energy, Interior, Health and Human Services\nCreation : 1913, 1979, 1989\nRanking : 1, 13, 11\nBudget_in_Billions : 10.7, 77.6, 59.7\nNum_Employees : 112557.0, 3000000.0, 235000.0\n*/\n\n \nCREATE TABLE chip_model (\n\t\"Model_name\" TEXT, \n\t\"Launch_year\" REAL, \n\t\"RAM_MiB\" REAL, \n\t\"ROM_MiB\" REAL, \n\t\"Slots\" TEXT, \n\t\"WiFi\" TEXT, \n\t\"Bluetooth\" TEXT, \n\tPRIMARY KEY (\"Model_name\")\n)\n/*\nColumns in chip_model and 3 examples in each column for high cardinality columns :\nModel_name : X30 mid-range, X50 Advanced, X51 mid-range\n*/\n/*\nColumns in chip_model and all categories for low cardinality columns :\nLaunch_year : 2002.0, 2005.0, 2004.0, 2003.0\nRAM_MiB : 32.0, 64.0\nROM_MiB : 48.0, 256.0, 128.0, 32.0, 64.0\nSlots : 1CFII,1SD, 1SD\nWiFi : 802.11b, No\nBluetooth : 1.2, Yes, No, 1.1\n*/\n \nCREATE TABLE head (\n\t\"head_ID\" INTEGER, \n\tname TEXT, \n\tborn_state TEXT, \n\tage REAL, \n\tPRIMARY KEY (\"head_ID\")\n)\n/*\nColumns in head and all categories for low cardinality columns :\nhead_ID : 1, 2, 5, 7, 8, 4, 6, 3, 10, 9\nname : Jeff Maggert, Pádraig Harrington, Billy Mayfair, K. J. Choi, Dudley Hart, Sergio García, Stewart Cink, Tiger Woods, Nick Faldo, Franklin Langham\nborn_state : Delaware, Connecticut, Alabama, California, Florida\nage : 69.0, 67.0, 68.0, 53.0, 56.0, 52.0, 50.0, 43.0\n*/\n \nCREATE TABLE mountain (\n\t\"Mountain_ID\" INTEGER, \n\t\"Name\" TEXT, \n\t\"Height\" REAL, \n\t\"Prominence\" REAL, \n\t\"Range\" TEXT, \n\t\"Country\" TEXT, \n\tPRIMARY KEY (\"Mountain_ID\")\n)\n/*\nColumns in mountain and all categories for low cardinality columns :\nMountain_ID : 1, 2, 5, 7, 4, 6, 3\nName : Ngaliema / Mt Stanley (Margherita Pk), Mount Kenya (Lenana), Kibo (Uhuru Pk), Ngaliema / Mt Stanley (Savoia Pk), Mount Kenya (Batian), Duwoni / Mt Speke (Vittorio Emanuele Pk), Mawenzi (Hans Meyer Pk)\nHeight : 5109.0, 5199.0, 5895.0, 4890.0, 4985.0, 4977.0, 5148.0\nProminence : 720.0, 850.0, 3951.0, 3825.0, 130.0, 5885.0, 110.0\nRange : Kilimanjaro, Mount Kenya, Rwenzori\nCountry : DR Congo Uganda, Uganda, Tanzania, Kenya\n*/\n \nCREATE TABLE \"Restaurant_Type\" (\n\t\"ResTypeID\" INTEGER, \n\t\"ResTypeName\" VARCHAR(40), \n\t\"ResTypeDescription\" VARCHAR(100), \n\tPRIMARY KEY (\"ResTypeID\")\n)\n/*\nColumns in Restaurant_Type and all categories for low cardinality columns :\nResTypeID : 1, 2\nResTypeName : Sandwich, Stir-fry\nResTypeDescription : Classic Chinese cooking., Simplest there is.\n*/\n \nCREATE TABLE farm_competition (\n\t\"Competition_ID\" INTEGER, \n\t\"Year\" INTEGER, \n\t\"Theme\" TEXT, \n\t\"Host_city_ID\" INTEGER, \n\t\"Hosts\" TEXT, \n\tPRIMARY KEY (\"Competition_ID\"), \n\tFOREIGN KEY(\"Host_city_ID\") REFERENCES city (\"City_ID\")\n)\n/*\nColumns in farm_competition and all categories for low cardinality columns :\nCompetition_ID : 1, 2, 5, 4, 6, 3\nYear : 2004, 2013, 2005, 2006, 2003, 2002\nTheme : MTV Cube, Valentine's Day, Codehunters, Carnival M is back!, Aliens, MTV Asia Aid\nHost_city_ID : 1, 2, 5, 4, 3\nHosts : Mandy Moore and Ronan Keating, Alicia Keys, Shaggy and Coco Lee, Leehom Wang and Kelly Rowland, Miley Cyrus Jared Leto and Karen Mok, Vanness Wu and Michelle Branch\n*/\n \nCREATE TABLE \"Country\" (\n\tid INTEGER, \n\tname TEXT, \n\tPRIMARY KEY (id)\n)\n/*\nColumns in Country and 3 examples in each column for high cardinality columns :\nid : 1, 19694, 7809\nname : Scotland, Italy, Spain\n*/\n\n \nCREATE TABLE artist (\n\tartist_name TEXT(50) NOT NULL, \n\tcountry TEXT(20), \n\tgender TEXT(20), \n\tpreferred_genre TEXT(50), \n\tCONSTRAINT a_name PRIMARY KEY (artist_name), \n\tFOREIGN KEY(preferred_genre) REFERENCES genre (g_name) ON DELETE CASCADE\n)\n/*\nColumns in artist and all categories for low cardinality columns :\nartist_name : Prity, Michel, Topu, Shrikanta, Enrique, Farida\ncountry : India, UK, USA, Bangladesh\ngender : Male, Female\npreferred_genre : tagore, folk, modern, nazrul, blues, pop\n*/\n \nCREATE TABLE \"Organizations\" (\n\torganization_id INTEGER NOT NULL, \n\tparent_organization_id INTEGER, \n\torganization_details VARCHAR(255), \n\tPRIMARY KEY (organization_id)\n)\n/*\nColumns in Organizations and all categories for low cardinality columns :\norganization_id : 7, 8, 10\nparent_organization_id : 7, 8\norganization_details : Denesik and Sons Party, Reinger, Hudson and Nolan Group, Robel-Schulist Group\n*/\n \nCREATE TABLE school (\n\t\"School_ID\" INTEGER, \n\t\"School\" TEXT, \n\t\"Location\" TEXT, \n\t\"Enrollment\" REAL, \n\t\"Founded\" REAL, \n\t\"Denomination\" TEXT, \n\t\"Boys_or_Girls\" TEXT, \n\t\"Day_or_Boarding\" TEXT, \n\t\"Year_Entered_Competition\" REAL, \n\t\"School_Colors\" TEXT, \n\tPRIMARY KEY (\"School_ID\")\n)\n/*\nColumns in school and all categories for low cardinality columns :\nSchool_ID : 1, 2, 5, 4, 6, 3\nSchool : St Aloysius' College, Cranbrook School, Waverley College, Knox Grammar School, Barker College, Trinity Grammar School\nLocation : Hornsby, Summer Hill, Waverley, Bellevue Hill, Milsons Point, Wahroonga\nEnrollment : 1000.0, 1850.0, 2200.0, 1200.0, 2300.0, 1430.0\nFounded : 1918.0, 1924.0, 1913.0, 1879.0, 1903.0, 1890.0\nDenomination : Catholic, Uniting Church, Anglican\nBoys_or_Girls : Boys only to Yr 9 Co-ed Year 10 to 12, Boys\nDay_or_Boarding : Day, Day & Boarding\nYear_Entered_Competition : 1944.0, 1929.0\nSchool_Colors : Royal Blue and Gold, Black & Blue, Red, White & Blue, Red & Blue, Green and White\n*/\n \nCREATE TABLE flight (\n\tid INTEGER, \n\t\"Vehicle_Flight_number\" TEXT, \n\t\"Date\" TEXT, \n\t\"Pilot\" TEXT, \n\t\"Velocity\" REAL, \n\t\"Altitude\" REAL, \n\tairport_id INTEGER, \n\tcompany_id INTEGER, \n\tPRIMARY KEY (id), \n\tFOREIGN KEY(company_id) REFERENCES operate_company (id), \n\tFOREIGN KEY(airport_id) REFERENCES airport (id)\n)\n/*\nColumns in flight and 3 examples in each column for high cardinality columns :\nid : 1, 13, 11\nVehicle_Flight_number : M2-F1 #14, M2-F1 #61, M2-F1 #0\nDate : July 16, 1965, May 19, 1964, March 28, 1966\n*/\n/*\nColumns in flight and all categories for low cardinality columns :\nPilot : Thompson, Peterson\nVelocity : 240.0, 135.0\nAltitude : 3650.0, 0.0\nairport_id : 1, 2, 5, 8, 4, 6, 3, 9\ncompany_id : 1, 13, 11, 2, 5, 7, 4, 6, 3, 9\n*/\n \nCREATE TABLE \"Type_Of_Restaurant\" (\n\t\"ResID\" INTEGER, \n\t\"ResTypeID\" INTEGER, \n\tFOREIGN KEY(\"ResID\") REFERENCES \"Restaurant\" (\"ResID\"), \n\tFOREIGN KEY(\"ResTypeID\") REFERENCES \"Restaurant_Type\" (\"ResTypeID\")\n)\n/*\nColumns in Type_Of_Restaurant and all categories for low cardinality columns :\nResID : 1, 2\nResTypeID : 1, 2\n*/\n \nCREATE TABLE journalist (\n\t\"journalist_ID\" INTEGER, \n\t\"Name\" TEXT, \n\t\"Nationality\" TEXT, \n\t\"Age\" TEXT, \n\t\"Years_working\" INTEGER, \n\tPRIMARY KEY (\"journalist_ID\")\n)\n/*\nColumns in journalist and 3 examples in each column for high cardinality columns :\njournalist_ID : 1, 11, 2\nName : Jack Meaney, Herbert Swindells, Jackie Waring\n*/\n/*\nColumns in journalist and all categories for low cardinality columns :\nNationality : Northern Ireland, Wales, England\nAge : 37, 28, 25, 33, 34, 43, 27, 29\nYears_working : 1, 5, 7, 8, 21, 6, 3, 12, 10, 9\n*/\n \nCREATE TABLE list (\n\t\"LastName\" TEXT, \n\t\"FirstName\" TEXT, \n\t\"Grade\" INTEGER, \n\t\"Classroom\" INTEGER, \n\tPRIMARY KEY (\"LastName\", \"FirstName\")\n)\n/*\nColumns in list and 3 examples in each column for high cardinality columns :\nLastName : HOUTCHENS, GELL, FLACHS\nFirstName :  RAY,  EMILE,  PATRINA\nClassroom : 109, 110, 106\n*/\n/*\nColumns in list and all categories for low cardinality columns :\nGrade : 1, 2, 5, 4, 6, 3, 0\n*/\n"
6
    },
7
    {
8
      "role": "user",
9
      "content": "How many heads of the departments are older than 56 ?"
10
    },
11
    {
12
      "role": "assistant",
13
      "content": "\nTo construct the query, I'll be working with the following tables: head.\nFrom these tables, I'll be using the following columns: age.\nThe SQL query I'll be generating is:\nSELECT count(*) FROM head WHERE age  >  56"
14
    }
15
  ]
16
}

提交训练集进行微调

一旦我们创建了JSONL文件（可以在这里或ipfs_here找到一个小样本），下一步是使用以下命令将创建的文件上传到OpenAI：


xxxxxxxxxx
2
1
openai.api_key = os.getenv("OPENAI_API_KEY")
2
print(openai.File.create(file=open("spider-finetuning.jsonl", "rb"),purpose='fine-tune'))

上传文件后，可以使用以下命令检查上传的状态：


xxxxxxxxxx
3
1
print(openai.File.retrieve(id="file-id"))
2
# 或者
3
print(openai.File.list())

结果应该类似于：


xxxxxxxxxx
10
1
{
2
"object": "file",
3
"id": "file-id",
4
"purpose": "fine-tune",
5
"filename": "file",
6
"bytes": 71699079,
7
"created_at": 1693343752,
8
"status": "uploaded",
9
"status_details": null
10
}

当状态变为已处理时（类似于下面的示例），您可以将文件用于微调：


xxxxxxxxxx
10
1
{
2
"object": "file",
3
"id": "file-id",
4
"purpose": "fine-tune",
5
"filename": "file",
6
"bytes": 71699079,
7
"created_at": 1693343752,
8
"status": "processed",
9
"status_details": null
10
}

现在，我们准备开始微调作业。可以使用以下python代码创建一个微调作业：


xxxxxxxxxx
8
1
print(openai.FineTuningJob.create(
2
training_file="file-id",
3
model="gpt-3.5-turbo",
4
suffix = "spider",
5
hyperparameters = {
6
"n_epochs": #number_of_epochs,
7
})
8
)

微调过程的持续时间将根据微调数据集的大小而有所不同。微调有一个最大令牌限制，设置为50000000个令牌。因此，在使用Spider数据集时，我们将样本数量从7000个减少到5750个，并进行总共2个时期的微调。

您可以使用以下命令检查微调作业的状态：


xxxxxxxxxx
1
1
print(openai.FineTuningJob.retrieve(id="ftjob-id"))

结果应类似于：


xxxxxxxxxx
19
1
{
2
"object": "fine_tuning.job",
3
"id": "ftjob-id",
4
"model": "gpt-3.5-turbo-0613",
5
"created_at": 1693346245,
6
"finished_at": 1693353313,
7
"fine_tuned_model": "ft:gpt-3.5-turbo-0613:dataherald:spider:id",
8
"organization_id": "org-id",
9
"result_files": [
10
"file-id"
11
],
12
"status": "succeeded",
13
"validation_file": null,
14
"training_file": "file-id",
15
"hyperparameters": {
16
"n_epochs": 2
17
},
18
"trained_tokens": 44722020
19
}

模型性能

DIN-SQL是一种用于自然语言到SQL转换的模型，它在处理Spider数据集上取得了最新的研究成果。DIN-SQL代表"Denoising-inductive SQL generation"，该模型结合了去噪自编码器和归纳式学习的方法，使用GPT-4作为生成器模型来生成SQL查询语句。DIN-SQL使用了各种高级提示技术，包括少量示例提示、思路链提示和分解提示，以提高其性能和准确度。该模型具有较高的准确性和效率，但在成本和处理时间方面可能较高。

我们将微调模型的性能与未经微调的GPT3.5-Turbo和DIN-SQL + GPT-4（Spider的当前最先进方法）进行了基准测试，以获得零-shot性能。

结果如下

微调的G-3.5-Turbo的性能与之前的方法相比进的提示技术，包括少量提示、思维链提示和分解提示）保持一致，这是当前最先进的方法。

关键是，与DIN-SQL + GPT-4方法相比，微调模型显著降低了成本和处理时间。下表提供了每个来自Spider基准的问题的不同模型之间的大致成本和速度。

不同模型每个问题的成本和速度（来自Spider基准）

如上所示，与DIN-SQL与GPT-4相比，微调的GPT-3.5-Turbo模型的成本降低了30倍，速度提高了12倍。

结论

通过投入时间和金钱来构建训练数据集，可以在准确性上与最先进的方法相匹配，同时速度提高12倍，成本降低30倍。如果，针对具体的业务做针对性的微调准确性应该可以进一步提高。

参考：

https://medium.com/dataherald/fine-tuning-gpt-3-5-turbo-for-natural-language-to-sql-4445c1d37f7c